[SOUND] This lecture is
about Text Representation.
In this lecture we're going to discuss
text representation and discuss how
natural language processing can allow us
to represent text in many different ways.
Let's take a look at this
example sentence again.
We can represent this sentence
in many different ways.
First, we can always represent such
a sentence as a string of characters.
This is true for all the languages.
When we store them in the computer.
When we store a natural language
sentence as a string of characters.
We have perhaps the most general
way of representing text since
we can always use this approach
to represent any text data.
But unfortunately using such
a representation will not help us to
semantic analysis, which is often needed
for many applications of text mining.
The reason is because we're
not even recognizing words.
So as a string we are going to keep all
of the spaces and these ascii symbols.
We can perhaps count out what's
the most frequent character in
the English text or
the correlation between those characters.
But we can't really analyze semantics, yet
this is the most general way of
representing text because we
hadn't used this to represent
any natural language or text.
If we try to do a little bit more
natural language processing by
doing word segmentation,
then we can obtain a representation
of the same text, but
in the form of a sequence of words.
So here we see that we can identify words,
like a dog is chasing, etc.
Now with this level of representation
we suddenly can do a lot of things.
And this is mainly because words are the
basic units of human communication and
natural language.
So they are very powerful.
By identifying words, we can for
example, easily count what
are the most frequent words in this
document or in the whole collection, etc.
And these words can be
used to form topics.
When we combine related words together and
some words positive and
some words are negatives or
we can also do analysis.
So representing text data as a sequence
of words opens up a lot of interesting
analysis possibilities.
However, this level of representation
is slightly less general than string of
characters.
Because in some languages, such as
Chinese, it's actually not that easy to
identified all the word boundaries,
because in such a language you see
text as a sequence of characters
with no space in between.
So you have to rely on some special
techniques to identify words.
In such a language of course then we
might make mistakes in segmenting words.
So the sequence of words representation
is not as robust as string of characters.
But in English, it's very easy to
obtain this level of representation.
So we can do that all the time.
Now if we go further to do in that round
of processing we can add a part of
these text.
Now once we do that we can count, for
example, the most frequent nouns or
what kind of nouns are associated
with what kind of verbs, etc.
So, this opens up a little bit
more interesting opportunities for
further analysis.
Note that I use a plus sign here because
by representing text as a sequence
of part of speech tags,
we don't necessarily replace
the original word sequence written.
Instead, we add this as an additional
way or representing text data.
So now the data is represented
as both a sequence of words and
a sequence of part of speech tags.
This enriches the representation
of text data, and,
thus also enables a more
interesting analysis.
If we go further,
then we'll be pausing the sentence
to obtain a syntactic structure.
Now this of course will
further open up more
interesting analysis of, for example,
the writing styles or
correcting grammar mistakes.
If we go further for semantic analysis.
Then we might be able to
recognize dog as an animal.
And we also can recognize boy as a person,
and playground as a location.
And we can further
analyse their relations.
For example, dog was chasing the boy,
and boy is on the playground.
This will add more entities and relations,
through entity relation recreation.
At this level,
we can do even more interesting things.
For example, now we can counter
easily the most frequent person
that's managing this whole
collection of news articles.
Or whenever you mention this person
you also tend to see mentioning
of another person, etc.
So this is very a useful representation.
And it's also related to the knowledge
graph that some of you may have heard of
that Google is doing as a more semantic
way of representing text data.
However it's also less
robust sequence of words.
Or even syntactical analysis,
because it's not always easy
to identify all the entities with the
right types and we might make mistakes.
And relations are even harder to find and
we might make mistakes.
This makes this level of representation
less robust, yet it's very useful.
Now if we move further to logic group
condition then we have predicates and
inference rules.
With inference rules we can infer
interesting derived facts from the text.
So that's very useful but
unfortunately, this level of
representation is even less robust and
we can make mistakes.
And we can't do that all the time for
all kinds of sentences.
And finally speech acts would add a yet
another level of rendition of
the intent of saying this sentence.
So in this case it might be a request.
So knowing that would allow us to you
know analyze more even more interesting
things about the observer or
the author of this sentence.
What's the intention of saying that?
What scenarios or
what kind of actions will be made?
So this is, Another role of analysis
that would be very interesting.
So this picture shows that if
we move down, we generally see
more sophisticated and natural language
processing techniques will be used.
And unfortunately such techniques
would require more human effort.
And they are less accurate.
That means there are mistakes.
So if we analyze our text at
the levels that are representing
deeper analysis of language then
we have to tolerate errors.
So that also means it's still necessary
to combine such deep analysis
with shallow analysis based on,
for example, sequence of words.
On the right side, you see the arrow
points down to indicate that
as we go down, with our representation of
text is closer to knowledge representation
in our mind and need for
solving a lot of problems.
Now, this is desirable because as we can
represent text as a level of knowledge,
we can easily extract the knowledge.
That's the purpose of text mining.
So, there was a trade off here.
Between doing deeper analysis
that might have errors but
would give us direct knowledge
that can be extracted from text.
And doing shadow analysis
which is more robust but
wouldn't actually give us the necessary
deeper representation of knowledge.
I should also say that text
data are generated by humans,
and are meant to be consumed by humans.
So as a result, in text data analysis,
text mining,
humans play a very important role.
They are always in the loop,
meaning that we should optimize
a collaboration of humans and computers.
So, in that sense it's okay that
computers may not be able to
have completely accurate
representation of text data.
And patterns that are extracted from
text data can be interpreted by humans.
And then humans can guide the computers to
do more accurate analysis by annotating
more data, by providing features to
guide machine learning programs,
to make them work more effectively.
[MUSIC]

